There
are several types of Hive Query Optimization techniques are available
while running our hive queries to improve Hive performance with some
Hive Performance tuning techniques. So, in this Hive Optimization
Techniques article, Hive Optimization Techniques for Hive Queries we
will learn how to optimize hive queries to execute them faster on our
cluster, types of Hive Optimization Techniques for Queries: Execution
Engine, Usage of Suitable File Format, Hive Partitioning, Bucketing in
Apache Hive, Vectorization in Hive, Cost-Based Optimization in Hive, and
Hive Indexing.
Types
of Query Optimization Techniques in Hive
-
Tez-Execution Engine in Hive
-
Usage of Suitable File Format in Hive(ORCFILE Formate)
-
Hive Partitioning
-
Bucketing in Hive
-
Vectorization In Hive
-
Cost-Based Optimization in Hive (CBO)
-
Hive Indexing
Tez-Execution Engine in Hive:- Hive
Optimization Techniques, to increase the Hive performance of our hive
query by using execution engine as Tez. On defining Tez, it is a new
application framework built on Hadoop Yarn. That executes
complex-directed acyclic graphs of general data processing tasks.
However, we can consider it to be a much more flexible and powerful
successor to the map-reduce framework. In
addition, to write native YARN applications on Hadoop that bridges the
spectrum of interactive and batch workloads Tez offers an API framework
to developers. To be more specific, to work with petabytes of data over
thousands of nodes it allows those data access applications.
Enabling Tez:- Tez can be enabled for the entire cluster at the bootstrap level or for individual queries at run-time by setting set hive.execution.engine = tez.
If administrators configure Tez for the entire cluster then individual
queries can be reverted back to MapReduce by setting set hive.execution.engine = mr at the start of the job.
Enabling Tez:- set hive.execution.engine =spark
Usage of Suitable File Format in Hive:- ORCFILE File Formate – Hive
Optimization Techniques, if we use appropriate file format on
the basis of data. It will drastically increase our query
performance. Basically, for increasing your query performance ORC
file format is best suitable. Here, ORC refers to Optimized Row
Columnar. That implies we can store data in an optimized way than the
other file formats. To be more specific, ORC reduces the size of the original data up to
75%. Hence, data processing speed also increases. On comparing to
Text, Sequence and RC file formats, ORC shows better performance.
Basically, it contains rows data in groups. Such as Stripes along
with a file footer. Therefore, we can say when Hive is processing
the data ORC format improves the performance.
Hive Partitioning:- Hive reads all the data in the directory
Without partitioning. Further, it applies the query filters on it.
Since all data has to be read this is a slow as well as expensive. Also, users need to filter
the data on specific column values frequently. Although, users need
to understand the domain of the data on which they are doing
analysis, to apply the partitioning in the Hive.Basically, by
Partitioning all the entries for the various columns of the dataset
are segregated and stored in their respective partition. Hence, While
we write the query to fetch the values from the table, only the
required partitions of the table are queried. Thus it reduces the
time taken by the query to yield the result.
Bucketing in Hive:- Hive
Optimization Techniques, let’s suppose a scenario. At times, there
is a huge dataset available. However, after partitioning on a
particular field or fields, the partitioned file size doesn’t match
with the actual expectation and remains huge. Still, we want to
manage the partition results into different parts. Thus, to solve
this issue of partitioning, Hive offers Bucketing concept. Basically,
that allows the user to divide table data sets into more manageable
parts.Hence, to maintain parts that are more manageable
we can use Bucketing. Through it, the user can set the size of the
manageable parts or Buckets too.
Vectorization in Hive:- Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. A standard query execution system processes one row at a time. This involves long code paths and significant metadata interpretation in the inner loop of execution. Vectorized query execution streamlines operations by processing a block of 1024 rows at a time. Within the block, each column is stored as a vector (an array of a primitive data type).However, this feature is introduced in Hive 0.13. It significantly improves query execution time, and is easily enabled with two parameters settings:
Enable Hive Vectorization.
set hive.vectorized.execution = true
set hive.vectorized.execution.enabled = true
Vectorization in Hive:- Vectorized query execution is a Hive feature that greatly reduces the CPU usage for typical query operations like scans, filters, aggregates, and joins. A standard query execution system processes one row at a time. This involves long code paths and significant metadata interpretation in the inner loop of execution. Vectorized query execution streamlines operations by processing a block of 1024 rows at a time. Within the block, each column is stored as a vector (an array of a primitive data type).However, this feature is introduced in Hive 0.13. It significantly improves query execution time, and is easily enabled with two parameters settings:
Enable Hive Vectorization.
set hive.vectorized.execution = true
set hive.vectorized.execution.enabled = true
When vectorization is enabled, Hive examines the query and the data to determine whether vectorization can be supported. If it cannot be supported, Hive will execute the query with vectorization turned off.
Supported Functionality
The current implementation supports only single table read-only queries. DDL queries or DML queries are not supported.The supported operators are selection, filter and group by.Partitioned tables are supported.
Cost-base Optimization in Hive:- Before submitting for final execution Hive optimizes each Query’s logical and physical execution plan. Although, until now these optimizations are not based on the cost of the query. However, CBO, performs, further optimizations based on query cost in a recent addition to Hive. That results in potentially different decisions: how to order joins, which type of join to perform, the degree of parallelism and others.
To use CBO, set the following parameters at the beginning of your query:
set hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.column.stats=true;
set hive.stats.fetch.partition.stats=true;
Then, prepare the data for CBO by running Hive’s “analyze” command to collect various statistics on the tables for which we want to use CBO.
Difference between Sort By and Order By
Hive uses the columns in SORT BY to sort the rows before feeding the rows to a reducer. The sort order will be dependent on the column types. If the column is of numeric type, then the sort order is also in numeric order. If the column is of string type, then the sort order will be lexicographical order. It orders data at each of ‘N’ reducers , but each reducer can have overlapping ranges of data.
SELECT emp_id, emp_salary FROM employees SORT BY emp_salary DESC;
Reducer-1
emp_id | emp_salary
10 5000
16 3000
13 2600
19 1800
Reducer-2
emp_id | emp_salary
10 5000
16 3000
13 2600
19 1800
As, we can see, values in each reducer output is ordered but total ordering is missing , since we end up with multiple outputs per reducer. In Hive, ORDER BY guarantees total ordering of data, but for that it has to be passed on to a single reducer, which is normally performance intensive and therefore in strict mode, hive makes it compulsory to use LIMIT with ORDER BY so that reducer doesn’t get overburdened.
Ordering : Total Ordered data.
Outcome : Single output i.e. fully ordered.
For example :
SELECT emp_id, emp_salary FROM employees ORDER BY emp_salary DESC;
emp_id | emp_salary
10 5000
11 4000
17 3100
16 3000
13 2600
14 2500
20 2000
19 1800
We are Distributing By x on the following 5 rows to 2 reducer:
x1
x2
x4
x3
x1
Reducre 1
Reducre 2
x4
x3
Cluster By:- CLUSTER BY x ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers.
Ordering : Global ordering between multiple reducers.
Outcome : N or more sorted files with non-overlapping ranges.
For the same example as above , if we use Cluster By x, the two reducers will further sort rows on x:
Reducre 1
Reducre 2
x4
x3
<CATALOG>
<BOOK>
<TITLE>Hadoop Defnitive Guide</TITLE>
<AUTHOR>Tom White</AUTHOR>
<COUNTRY>US</COUNTRY>
<COMPANY>CLOUDERA</COMPANY>
<PRICE>24.90</PRICE>
<YEAR>2012</YEAR>
</BOOK>
<BOOK>
<TITLE>Programming Pig</TITLE>
<AUTHOR>Alan Gates</AUTHOR>
<COUNTRY>USA</COUNTRY>
<COMPANY>Horton Works</COMPANY>
<PRICE>30.90</PRICE>
<YEAR>2013</YEAR>
</BOOK>
</CATALOG>
CREATE TABLE book_details(TITLE STRING, AUTHOR STRING,COUNTRY STRING,COMPANY STRING,PRICE FLOAT,YEAR INT)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.TITLE"="/BOOK/TITLE/text()",
"column.xpath.AUTHOR"="/BOOK/AUTHOR/text()",
"column.xpath.COUNTRY"="/BOOK/COUNTRY/text()",
"column.xpath.COMPANY"="/BOOK/COMPANY/text()",
"column.xpath.PRICE"="/BOOK/PRICE/text()",
"column.xpath.YEAR"="/BOOK/YEAR/text()")
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES ("xmlinput.start"="<BOOK","xmlinput.end"= "</BOOK>");
ADD Jar file
path/hivexmlserde-1.0.0.0.jar
https://saurzcode.in/2017/10/configure-spark-application-eclipse/#more-1101
x2
x4
x3
x1
Reducre 1
x1
x2
x1
x4
x3
Cluster By:- CLUSTER BY x ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers.
Ordering : Global ordering between multiple reducers.
Outcome : N or more sorted files with non-overlapping ranges.
For the same example as above , if we use Cluster By x, the two reducers will further sort rows on x:
Reducre 1
x1
x2
x1
x4
x3
How to load XML Data into Hive table?
<BOOK>
<TITLE>Hadoop Defnitive Guide</TITLE>
<AUTHOR>Tom White</AUTHOR>
<COUNTRY>US</COUNTRY>
<COMPANY>CLOUDERA</COMPANY>
<PRICE>24.90</PRICE>
<YEAR>2012</YEAR>
</BOOK>
<BOOK>
<TITLE>Programming Pig</TITLE>
<AUTHOR>Alan Gates</AUTHOR>
<COUNTRY>USA</COUNTRY>
<COMPANY>Horton Works</COMPANY>
<PRICE>30.90</PRICE>
<YEAR>2013</YEAR>
</BOOK>
</CATALOG>
CREATE TABLE book_details(TITLE STRING, AUTHOR STRING,COUNTRY STRING,COMPANY STRING,PRICE FLOAT,YEAR INT)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.TITLE"="/BOOK/TITLE/text()",
"column.xpath.AUTHOR"="/BOOK/AUTHOR/text()",
"column.xpath.COUNTRY"="/BOOK/COUNTRY/text()",
"column.xpath.COMPANY"="/BOOK/COMPANY/text()",
"column.xpath.PRICE"="/BOOK/PRICE/text()",
"column.xpath.YEAR"="/BOOK/YEAR/text()")
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES ("xmlinput.start"="<BOOK","xmlinput.end"= "</BOOK>");
ADD Jar file
path/hivexmlserde-1.0.0.0.jar
https://saurzcode.in/2017/10/configure-spark-application-eclipse/#more-1101
No comments:
Post a Comment